in silico Plants — Latest Matching Preprints

1

A guaranteed-convergence algorithm for coupled leaf photosynthesis–transpiration–stomatal conductance models

Masutomi, Y.;Kobayashi, K.

2026-07-08 Plant Biology 10.64898/2026.06.24.734164 medRxiv

Top 0.1%

13.1%

Show abstract

The photosynthesis-transpiration-stomatal conductance (An-E-gs) model framework is widely used for estimating photosynthesis, transpiration, and stomatal conductance in plants. The model equations are solved by numerical iteration, and the converged model values are deemed the solution. However, there has been no general guarantee that the iterative procedure converges to a solution or that the procedure leads to convergence. Building on the recent proof of the existence of a unique set of solutions, we herewith propose a numerical algorithm that is guaranteed to converge to the solution for the An-E-gs model framework. We first analytically prove that the proposed algorithm necessarily converges to a solution. We then demonstrate the convergence across contrasting combinations of leaf temperature, relative humidity, light, atmospheric CO2, and wind speed. We further demonstrate rapid convergence with the algorithm: no more than ca. 10 iterations for approximately 10-3 mol CO2 m-2 s-1 precision in net photosynthesis and no more than ca. 20 iterations for 10-7 mol CO2 m-2 s-1 precision. By guaranteeing convergence to the solution, this algorithm eliminates concerns about nonconvergence in leaf gas-exchange calculations and is expected to serve as a robust foundation for a range of studies from leaf-level gas exchange to global-scale carbon and water cycle dynamics.

2

Knowledge-guided Bayesian optimization using pre-trained LLMs speeds up the identification of superior genotypes from germplasm collection

Hamazaki, K.; Tsuda, K.

2026-07-02 bioinformatics 10.64898/2026.06.28.735149 medRxiv

Top 0.1%

9.0%

Show abstract

Background: Germplasm collections contain wide genetic diversity that is valuable for plant breeding, but conducting phenotypic evaluation for all genotypes in field trials is rarely feasible. Bayesian optimization offers a way to decide, season by season, which genotypes to cultivate in order to identify superior genotypes with fewer evaluations. However, standard Bayesian optimization commonly starts from randomly selected genotypes and mainly relies on surrogate models built from marker genotype information, while the text-based passport information that accompanies germplasm is not fully used. We examined whether pre-trained large language models can provide prior knowledge that improves these decisions in germplasm evaluation. Results: We constructed a large-language-model-guided Bayesian optimization framework that introduces large language models into two parts of the Bayesian optimization workflow. In zero-shot warmstarting, a large language model proposes initial genotypes using passport information such as cultivar name, country of origin, and subpopulation, optionally together with principal component scores derived from genome-wide single-nucleotide-polymorphism markers. In addition, we evaluated a large-language-model-based surrogate model that predicts phenotypic values for untested genotypes using in-context learning from previously evaluated genotypes. Using a rice germplasm panel and two target traits (seed number per panicle for maximization and protein content for minimization), we compared strategies. For seed number per panicle, zero-shot warmstarting with a general-purpose instruction-following model reduced the number of evaluated genotypes needed to reach the best genotype, whereas improvements were small for protein content. When genomic information was available, Gaussian-process-based Bayesian optimization was the strongest overall approach, while the large-language-model-based surrogate model outperformed random baselines and was competitive in some settings. When genomic information was not available, predictions based on passport information improved efficiency compared with fully random strategies. Conclusions: Pre-trained large language models can inject useful agronomic knowledge into Bayesian optimization for germplasm evaluation, particularly by improving early-stage genotype selection, and can also support optimization when genomic information is unavailable. As models better handle long genomic sequences together with passport information, large-language-model-guided Bayesian optimization may become a practical and explainable decision-support approach for agricultural optimization.

3

Data-informed modelling captures metabolic reprogramming and reveals branch points mediating cold stress response and growth trade-offs in rice

Soltani, F.; Moreira Machado, T.; Weder, J.-N.; Camborda de la Cruz, S.; Peleke, F. F.; Szymanski, J. J.; Töpfer, N.

2026-07-07 plant biology 10.64898/2026.07.07.736767 medRxiv

Top 0.1%

5.3%

Show abstract

Understanding stress-induced metabolic reprogramming in crop plants can inform breeding strategies and support the development of stress-resilient varieties. Genome-scale metabolic modelling has shown promise in elucidating network-level responses to changing environments, yet as an optimality-based approach it relies on the definition of an objective function, which is far from trivial for non-optimal conditions. To address this uncertainty, we used a time-resolved, data-informed metabolic model of rice (Oryza sativa L.) cold stress response as a test case, and explored two complementary approaches. We used sampling of the solution space combined with machine learning to identify reactions and pathways best characterizing the stress-induced metabolic shift, and used this information to perform Pareto analysis, placing growth and a stress-related objective in competition. This trade-off analysis identified key branch points in carbohydrate, amino acid, phenylpropanoid, nucleotide, and fatty acid biosynthesis, where resource reallocation towards stress-protection comes at the expense of growth. It further revealed differential flux modes across subcellular compartments and shifts in reducing equivalent provision as distinguishing features of the stress response. Together, these results provide a mechanistic understanding of the metabolic trade-offs and branch points governing cold stress response, and identify potential targets to optimize the cold response-growth trade-off in rice.

4

simSOMA: a cell-lineage based simulator of the somatic VAF spectrum in plants

Johannes, F.

2026-07-01 genomics 10.64898/2026.06.28.735079 medRxiv

Top 0.1%

2.4%

Show abstract

Plants accumulate somatic mutations during growth, and some of these mutations can spread from local cell lineages into branches, organs, or reproductive tissues. There is growing interest in these variants because they can underlie bud-sport traits in crops, contribute to within-organism somatic selection, and provide genetic variation that may be transmitted vegetatively or sexually to future generations. Recent genomic sequencing of bulk and layer-enriched plant tissues has shown that de novo somatic variants can generate complex variant allele-frequency (VAF) spectra. Interpreting these spectra requires understanding how mutations arising during mitotic cell division are filtered or amplified through shoot growth, branching, and organ formation. Because these processes interact across multiple scales, their combined effects are difficult to derive analytically. Here, we present simSOMA, a modular simulator that links rooted plant topologies to explicit cell-lineage dynamics. simSOMA models somatic mutation accumulation during stem-cell self-renewal in the shoot apical meristem, clonal expansion from the stem-cell niche to the meristem periphery, branch founding, and organ formation. Applying simSOMA across diverse growth scenarios revealed how individual processes can be isolated, varied, and combined to assess their effects on organ-level VAF spectra and among-organ variant sharing. The same simulated spectra can also be transformed to represent bulk or layer-enriched sampling and phased or unphased variant readouts, separating effects of developmental history from those introduced by tissue composition and allele counting. Because simSOMA is organized around modules with defined input-output interfaces, individual developmental components can be replaced or extended as new empirical information becomes available. This makes simSOMA a flexible tool for testing alternative models of somatic mosaicism in plants and for guiding the design and interpretation of VAF-based sequencing studies. The simulator is available at https://github.com/jlab-code/simSOMA.

5

An axiomatic approach to cultivar ranking in multi-environment trials

Kondratev, A. Y.; Ianovski, E.; Voronina, E.; Crossa, J.

2026-07-01 genetics 10.64898/2026.06.27.734959 medRxiv

Top 0.1%

2.1%

Show abstract

Multi-environment trials are central to cultivar evaluation because they reveal how candidate cultivars perform across locations, years, management conditions, and stress environments. The resulting yield matrix is a rich source of data on genotype-by-environment interaction, and a wide literature on estimation, decomposition, visualisation, and prediction of yield potential and stability has flourished. However the ultimate question of which cultivar to recommend on the basis of such a matrix is often left implicit. The question is far from trivial, and in this paper we formulate cultivar recommendation as an axiomatic ranking problem. This framework is rich enough to encompass the existing literature on stability indices, as well as any other deterministic ranking procedure. We show that many commonly used stability-based procedures can violate minimal criteria of efficiency or consistency. The result of such violations is that a cultivar with uniformly high yield could be ranked below a cultivar with uniformly low yield, or the relative ranks of two cultivars could depend on whether or not a third cultivar is present in the matrix. Our results prove that under a small number of such criteria the space of admissible rules collapses to the family of power means and their limiting cases. If we further wish to allow multiplication normalisation of yield, we are left with the geometric mean as the unique solution.

6

Beyond climatic drought indices : an hydraulic approach to quantifying forest water stress

Cochard, H.

2026-07-15 plant biology 10.64898/2026.07.13.738371 medRxiv

Top 0.2%

1.3%

Show abstract

The article introduces a new Forest Stress Index (ISF) based on a plant hydraulic modelling approach rather than classical climatic drought indices. Unlike other index like scPDSI or SPEI, ISF is grounded in xylem embolism dynamics simulated with the mechanistic SurEau model. The goal is to better link climatic anomalies to tree physiological functioning and mortality risk. ISF is defined using a locally adapted ideotype characterized by an optimal P50 value under a reference hydraulic functioning threshold. Simulations are performed across Europe and France using multiple climate datasets. The index is robust to model parameterization choices and assumptions about plant functional traits. Results show strong spatial and temporal consistency and significant correlations with SPEI and scPDSI. However, ISF more strongly highlights extreme drought years and exhibits a more skewed distribution. Future projections under SSP5-8.5 indicate a widespread increase in hydraulic stress with strong regional contrasts. Overall, ISF provides a mechanistic and complementary drought indicator more directly linked to forest mortality processes.

7

Enhancing predictive accuracy of yield traits in cassava through multi-trait genomic prediction

de Freitas, G. M.; Certuche, D. S.; Jannink, J.-L.; de Oliveira, E. J.; Garcia, A. A. F.

2026-07-06 genetics 10.64898/2026.07.01.735838 medRxiv

Top 0.2%

1.1%

Show abstract

Multi-trait genomic prediction offers a practical route to improve selection for costly, complex traits in clonally propagated crops such as cassava. In a Brazilian breeding panel of 1,078 cassava clones genotyped with 25,923 SNPs and phenotyped for six agronomic traits, we compared single-trait (ST) and multi-trait (MT) GBLUP models. Stage-wise mixed models produced BLUEs that fed into ST and MT-GBLUP. We tested five cross-validation schemes that mimic breeder realities: ST baseline (CV1); naive all-traits MT prediction for unphenotyped candidates (CV2); MT prediction using auxiliary trait phenotypes in the test set (CV3); and two sparse-phenotyping regimes with missingness by trait (CV4) or by clone (CV5) at 25%, 50%, and 75% levels. The main results were that, under the ST baseline (CV1), predictive ability ranged from 0.50 for DMC and 0.45 for FRY down to 0.13 for Le.Dis. A naive full MT model (CV2) performed approximately on par with ST-GBLUP. In contrast, MT designs (CV3) that included informative auxiliary traits, such as shoot yield and combinations with plant vigor and leaf disease severity, yielded small gains for DMC with predictive ability of approximately 0.51 (+2%), while FRY predictive ability increased to approximately 0.65 (+44%), accompanied by RMSE reductions for FRY up to approximately 13.5% (e.g. RMSE approximately 6.2). Sparse-phenotyping simulations (CV4/CV5) demonstrated that MT models sustain or even improve predictive ability under realistic missing-data regimes (PA {approx} 0.62 - 0.65). Selection concordance between MT and ST top-10% sets was generally high (>0.80), and MT configurations produced measurable improvements in expected selection response and genetic gain per cycle for several target traits. These results indicate that strategically implemented MT-GBLUP, using a small set of biologically and operationally informative auxiliary traits and optimized sparse phenotyping, can materially increase predictive accuracy and selection efciency for economically critical cassava traits while reducing phenotyping burden.

8

Large-scale analysis of optimisation methods for parameter estimation problems in the life sciences

Grein, S.; Penas, D. R.; Weindl, D.; Lakrisenko, P.; Banga, J. R.; Hasenauer, J.

2026-07-13 systems biology 10.64898/2026.07.11.737731 medRxiv

Top 0.2%

1.0%

Show abstract

Dynamic models are central to the computational life sciences but typically contain unknown parameters that must be inferred from experimental data. High-throughput measurements have made this task increasingly challenging, yielding high-dimensional search spaces and non-convex objectives with many local optima. This makes the choice of optimisation method critical. However, existing empirical studies either consider only a limited number of benchmark problems or only a narrow spectrum of local, global and hybrid optimisation methods. Here, we present a comprehensive benchmark of a broad range of optimisation methods on a curated collection of parameter estimation problems, comprising 990 method-problem-pairs executed on two independent supercomputing infrastructures. Our evaluation quantifies success rates, solution quality and computational cost, revealing characteristic strengths and limitations of each approach. We find that optimisation methods separated into clear performance tiers. Building on these results, we implemented a new hybrid strategy that combines enhanced scatter search with the best-performing local solver, which showed robust performance and improved on the other scatter-search variants we tested. Our results provide practical guidance for selecting optimisation methods and thereby support more accurate and reliable model calibration.

9

Text guidance is powerful but prompt-sensitive for weakly-supervised leaf symptom segmentation

Dubois, R.; Bousset, L.; Jumel, S.; Leclerc, M.; Parisey, N.; Joly, A.

2026-07-10 plant biology 10.64898/2026.07.10.737680 medRxiv

Top 0.2%

1.0%

Show abstract

Accurate segmentation of plant disease symptoms is essential for crop monitoring and phenotyping, yet it typically requires costly pixel-level annotations. Weakly supervised semantic segmentation (WSSS) alleviates this burden using image-level labels, but its performance depends on the quality of spatial priors such as class activation maps (CAMs). We investigate whether text-guided segmentation with the Segment Anything Model 3 (SAM3) can serve as an alternative weak supervision signal. Three pseudo-mask generation strategies are compared: (i) CAMs refined with SAM or SAM3, (ii) zero-shot text-guided SAM3, and (iii) a hybrid approach combining weak spatial cues with text prompts. The resulting pseudo-masks are used to train a DeepLabV3 model. Text guidance alone matches or outperforms conventional WSSS, achieving up to 0.46 IoU without spatial supervision and 0.61 IoU on a public dataset, although performance is sensitive to text prompt formulation. The hybrid strategy improves robustness, reaching 0.50 IoU on the primary dataset and 0.58 IoU on the additional dataset while reducing prompt sensitivity. Overall, text guidance is a promising alternative to conventional weak supervision, while hybrid approaches provide a more robust solution for plant disease segmentation.

10

Sunrise and sunset times are the main factors that determine the flowering time of photoperiod-sensitive sorghum

Clerget, B.; Sidibe, M.; vom Brocke, K.; Raharinivo, V.; Ortiz, D.; Trouche, G.

2026-07-08 plant biology 10.64898/2026.06.12.731875 medRxiv

Top 0.3%

0.9%

Show abstract

Crop photoperiodism models assume that flowering time is primarily controlled by daylength, yet many field observations contradict this view. We previously proposed an alternative framework integrating daily changes in sunrise and sunset times (dSR and dSS). Variety trials in Madagascar and in Argentina supported this concept: mid-late sorghum varieties from the northern hemisphere flowered late or very late when sown in November and December, consistent with the higher dSR/dSS values of the southern hemisphere summer. One Malian variety, sown monthly over six years in West Africa, exhibited high interannual variability in flowering time when sown between November and February. This revealed that up to four photoperiodic responses -- two quantitative and two qualitative, occurring at different times of the year -- may coexist within a single late photoperiod sensitive variety. All responses use only dSR and dSS cues. The qualitative responses are triggered by an internal phasic coincidence, which is set by a linear relationship between dSR and dSS at the onset of plant photoperiod sensitivity, and between dSR+dSS at panicle initiation. The research model fitted data from 28 varieties grown in Mali well. It also accurately fitted the duration to PI observed in three varieties sown at tropical and temperate latitudes. HighlightThe seasonal photoperiodic adaptation of flowering time in sorghum plants may rely on several signal transduction pathways regulated by sunrise and sunset times rather than day length.

11

Neural Processes with Normalizing Flows for Wheat Height Estimation

Boss, M.;Volpi, M.;Roth, L.

2026-07-09 Plant Biology 10.64898/2026.06.24.734247 medRxiv

Top 0.3%

0.7%

Show abstract

In this work, we investigate modeling plant traits over time using neural processes, a class of machine learning models that learn distributions over functions. Plant growth is an inherently stochastic process with complex dynamics measured mostly at irregular times throughout the growing seasons. While individual trait trajectories may be simple, their distributions are shaped by complex interactions between genotype, environment, and other factors. In particular, we focus on plant height in wheat, a deceptively simple-looking trait with complex dynamics. To model these trajectory distributions, we evaluate neural processes and in particular extensions using normalizing flows, with different combinations of genotype and environmental covariates. For controlled evaluations, we generate synthetic wheat height trajectories calibrated against Swiss weather station records and the FIP1 dataset. To fully evaluate these trajectory distributions, we use signatures, vector representations of sequential data, together with Sig-MMD and the recently introduced CSig-MMD. Sig-MMD enables direct pathwise comparison of predicted and simulator trajectory distributions, while CSig-MMD focuses this comparison on the tail, including lodged trajectories. Together, these metrics allow us to assess whether the models capture the full distribution of growth trajectories, including rare outcomes.

12

GCBM-DCT-HV-Bio-NL-Grow-CHG-CSM-RHEC: A Unified Geometric, Biological, Causal, and Regenerative Framework for Mechanism-Aware Tissue and Connectome Modeling

Xu, T.; Hu, Z.; Sun, X.; Jin, L.; Xiong, M.

2026-06-29 bioinformatics 10.64898/2026.06.24.734320 medRxiv

Top 0.3%

0.6%

Show abstract

Modern biological prediction problems increasingly require models that go beyond Euclidean feature regression and local graph smoothing. Tissue, cellular, and connectome systems are nonlinear, geometry-dependent, intervention-sensitive, history-dependent, and subject to regenerative or homeostatic constraints. We propose GCBM/DCT/HV/Bio/NL/Grow/CHG/CSM/RHEC, a unified model for mechanism-aware biological prediction. The model integrates geometric connectome dynamics, differentiable charted tissue geometry, Hamiltonian latent transport, nonlinear biological kinetics, nested latent memory, continual growth without overwriting, causal hypergraph structure, causal structure modeling, and regenerative homeostatic error correction. Unlike Euclidean baselines, which treat observations as flat vectors, and local graph baselines, which use neighborhood smoothing without mechanistic structure, the proposed model represents biological states (Trapnell 2015) as coupled geometric, dynamical, causal, and regenerative objects. We evaluate the model on four synthetic toy studies, Toy A, B,C, D, designed to reflect increasing biological complexity: local Euclidean structure, nonlinear mechano-chemical interaction, causal intervention response, and out-of-distribution regenerative shift. Compared with Euclidean and local graph baselines, the full model achieves the lowest mean squared error across all four toy studies. Relative to the Euclidean baseline, the full model reduces MSE by approximately 63.0%, 89.1%, 89.0%, and 90.9% on Toy A, Toy B, Toy C, and Toy D, respectively. These results support the value of integrating geometry, mechanism, causal structure, adaptive growth, and regenerative correction into a single predictive architecture (Figure 1).

13

Comp2GPR: A Sequence-Driven Framework for Gene.Protein-Reaction Rule Reconstruction

Castillo, S.

2026-06-26 bioinformatics 10.64898/2026.06.24.734174 medRxiv

Top 0.3%

0.6%

Show abstract

Accurate gene-protein-reaction (GPR) associations are essential for the predictive performance of genome-scale metabolic models (GEMs),as they define the mapping between genes, enzymes, and metabolic reactions. However, GPR rules are often incomplete or inconsistent due to limitations in annotation transfer and the ambiguous representation of multi-subunit protein complexes, leading to errors in downstream analyses such as gene essentiality prediction. Here, I introduce Comp2GPR, an automated pipeline for reconstructing GPR rules that integrates curated protein complex information with sequence-level evidence. Protein complexes were sourced from the Complex Portal and subjected to an AI-assisted curation workflow to retain only metabolically relevant assemblies. Comp2GPR combines deterministic sequence similarity mapping with explicit rule construction to generate Boolean GPR expressions that accurately represent obligate subunit relationships and isoenzyme redundancy. I evaluated the impact of the reconstructed GPR rules by integrating them into the Yeast9 metabolic model and comparing gene essentiality predictions with the original model. While global performance metrics remained largely unchanged, the updated model achieved a net improvement in prediction accuracy through gene-level corrections. Overall, Comp2GPR demonstrates that combining curated protein complex data with sequence-based validation improves the accuracy, interpretability, and reproducibility of GPR rules. The method provides a robust framework for enhancing metabolic model annotations and supports more reliable simulation-based analyses.

14

DeepPheno: A Deep Learning Framework for Linking Hyperspectral Imaging and SNP Genotypes in Lettuce

Okyere, F. G. G.; Mehrem, S. L.; Snoek, B. L.; Van den Ackerveken, G.; Abeln, S.

2026-07-10 plant biology 10.64898/2026.07.09.737449 medRxiv

Top 0.3%

0.6%

Show abstract

While whole genome sequencing captures millions of single nucleotide polymorphisms (SNPs) and hyperspectral imaging (HSI) enables non destructive plant phenotyping, integrating these modalities to link genotype to phenotype remains challenging due to their high dimensionality and non linearity. This study presents DeepPheno a deep learning framework that predicts SNP genotypes from HSI data, using model predictability as a proxy for genotype phenotype association. HSI data were acquired from 194 lettuce genotypes under field conditions. HSI data patches (20 x 20 pixels x 224 spectral bands) were used to train a hybrid CNN to predict the variant of a specific SNP. The framework was validated on SNPs with known phenotypic effects (anthocyanin, leaf serration, pale pigmentation), achieving high predictive performance (AUC ranging from 0.806 to 0.935), whereas models trained on randomly shuffled labels performed at chance (mean AUC {approx} 0.51). Extending the workflow to 50 randomly selected putatively neutral SNPs, most yielded low predictability, but two showed high performance (AUC > 0.76), suggesting uncharacterized genotype phenotype links. Explainable AI, including SHAP and Grad CAM, identified relevant spectral and spatial features driving these predictions, particularly the green and red edge wavelengths associated with pigment dynamics and leaf structure. These results establish a framework for understanding complex genotype phenotype interactions in plants and extracting these links from HSI data without predefining the exact trait values. It provides an avenue for high throughput trait discovery and description and extends the integration of image based phenomics with plant genetics.

15

Measuring magnetic field effects in fluorescent flavoproteins via spin-dependent fluorescence intensity requires photoexcitation to be faster than spin-independent ground state recovery

Ross, B. L.; Lodesani, A.; Aiello, C. D.

2026-07-13 biophysics 10.64898/2026.07.08.737352 medRxiv

Top 0.3%

0.6%

Show abstract

Weak magnetic fields affect many biological processes across the tree of life, though the precise molecular sensors and pathways involved in such magnetoresponses remain mostly uncharacterized. Fluorescence is a useful tool for investigating magnetic field effects in flavoproteins, as their chromophores fluorescence intensity can be shown to depend on the spin states of electronic radical pairs. Here, we describe a four-state ordinary differential equation model to understand what parameter sets result in fluorescence contrast between spin states in photocycles with singlet and triplet radical pairs. We conclude that only certain sets of parameters result in the fluorescence intensity being a good proxy measurement for singlet yield. In particular, we observe that the illumination intensity required to obtain fluorescence contrast depends on the rate of the slow spin-independent radical termination reactions that recover ground-state oxidized fluorophores. Moreover, to observe a magnetic field effect in fluorescence intensity when an external magnetic field modulates the singlet yield, the illumination intensity must be strong enough such that photoexcitation is not the rate-limiting step. This understanding suggests that flavoproteins that do not exhibit magnetic field effects in their fluorescence emission under certain experimental setups may still be sensitive to weak magnetic fields in terms of function, as magnetosensitivity in fluorescence depends strongly on illumination conditions.

16

Ecological connectivity modelling with WebAssembly

Southgate, A. J.; Redihough, J.

2026-07-09 ecology 10.64898/2026.07.08.737333 medRxiv

Top 0.3%

0.6%

Show abstract

Circuit theory has been successfully applied to ecological connectivity modelling, notably via the Circuitscape software, which is typically run locally on a laptop or via a server. For downstream geospatial web applications relying on connectivity analysis, backend infrastructure is required, which can be costly and require advanced data governance. Recent developments in WebAssembly now allow fast C++ or Rust code to be run directly in a sandboxed browser environment for edge computing. We present a WebAssembly/Rust toolset with a geospatial data pipeline and efficient edge-computing implementation of connectivity analysis. This approach may be useful for geospatial modelling software where rasters and memory footprint are small enough for the browser context. Our results show that as expected, Circuitscape solves 1000x1000 raster networks 1-2x faster, but requires further file writes. Accounting for total program runtime, our web implementation can be faster for the given context.

17

Comparison of localGEBV and Optimal Haplotype Stacking Fitness Functions using a Novel R Package: HapSelect

Shaffer, W.; Papin, V.; Carter, Z.; Brunner, S. M.; Tong, J.; Villiers, K.; Robinson, H.; Voss-Fels, K.; Hayes, B. J.; Hickey, L.; Dinglasan, E.

2026-07-13 genetics 10.64898/2026.07.08.737160 medRxiv

Top 0.4%

0.5%

Show abstract

Haplotype-based breeding strategies have emerged as promising approaches to maximize long-term genetic gain by identifying complementary parental combinations while maintaining genetic diversity. However, these methods typically require phased genotypes and more intensive workflow pipelines and skillsets. We developed a novel local genomic estimated breeding value (localGEBV) fitness function with similar intent to the optimal haplotype stacking (OHS) framework fitness function and implemented both in the novel R package, HapSelect. Our aim was to evaluate whether phased haplotypes provide additional benefit over the more easily available dosage-based unphased genotypes in highly inbred crops. A subset of bread wheat nested association mapping (NAM) population comprising 444 lines genotyped with 6,054 DArT-Seq markers was analysed. Marker effects were estimated using rrBLUP, localGEBV and haplotype effects were calculated across linkage disequilibrium-defined haploblocks, and genetic algorithms (GA) were used to identify optimal sets of 30 founders using either a localGEBV derived fitness function with unphased, dosage inputs or the OHS fitness function with phased inputs. Selected parental sets were compared with conventional truncation selection (TS) through 150 generations of forward simulation. The OHS fitness function achieved a marginally greater optimized ultimate GEBV than the localGEBV fitness function during GA optimization, with only 18 of the 30 selected founders overlapped between the two methods. Despite these differences, forward simulations demonstrated nearly identical long-term genetic gain for localGEBV and OHS-selected founders, with both approaches outperforming conventional truncation selection by maintaining greater genetic diversity and delaying the genetic plateau. The minimal difference between localGEBV and OHS is likely attributable to the high homozygosity of the population, where localGEBV and haplotype effects are nearly confounded. These results demonstrate that dosage-based localGEBV provides a practical alternative to phased haplotype approaches for parent selection in inbred crops, substantially simplifying genomic workflows while maintaining long-term breeding performance. Future work should evaluate these methods in more diverse inbred populations and outbred species, where great haplotypic diversity may increase the advantage of true haplotype-based optimizations.

18

Far-red timing uncovers cultivar-dependent yield and bolting responses in vertical-farm spinach (Spinacia oleracea L.)

McGovern, C.; Adrio, M.; Aliki, H.; Vichos, R.; Powell, W.; Sharma, R.

2026-07-13 plant biology 10.64898/2026.07.10.737849 medRxiv

Top 0.4%

0.4%

Show abstract

Far-red light (FR; 700-750 nm) is increasingly incorporated into controlled-environment lighting because it can improve photosynthetic efficiency when combined with comparatively shorter wavelengths. In long-day leafy crops such as spinach, however, FR may also promote the transition from vegetative to reproductive growth and thereby reduce marketable yield. Most studies have evaluated FR fraction, intensity or end-of-day exposure, whereas the developmental timing of FR has rarely been tested, particularly in spinach. Here, we evaluated six commercial spinach cultivars (Amador, Harp, Renegade, Responder, Rubino and Santa Cruz) in an indoor vertical farm under a common red-green-blue background (PPFD 260-264 {micro}mol m-{superscript 2} s-{superscript 1}, 12 h photoperiod, 24 {degrees}C) and four FR timing treatments: no FR (Control), FR throughout production (FullFR), FR during early development only (EarlyFR), and FR during late development only (LateFR). LateFR increased marketable fresh weight relative to Control (244 vs 224 g) and reduced flowering incidence, whereas far-red supplied during early development reduced fresh weight (158 g) and increased flowering. The magnitude of the timing response differed among cultivars: switching from EarlyFR to LateFR recovered 0 % fresh weight in Amador but 107 % in Renegade and Rubino, with the largest penalties occurring in otherwise bolt-resistant cultivars. EarlyFR also increased total chlorophyll and reduced the chlorophyll a:b ratio. These results show that FR response in spinach is strongly conditioned by developmental stage and cultivar. Although LateFR received more total far-red than EarlyFR, it behaved like the Control, indicating that the penalty was set by far-red timing rather than dose. Treatment differences in bolting and yield tracked an estimated phytochrome photostationary-state deficit during early development: a phytochrome-deficit model markedly outperformed a cumulative-dose model ({Delta}AIC = 441), and the deficit x cultivar interaction was strong (p < 0.001), with bolt-resistant cultivars losing most yield when far-red coincided with the early developmental window. We therefore propose that FR should be treated as a genotype-dependent management variable rather than as a fixed spectral input, with late application and bolt-resistant cultivars offering the most favourable combination for vertical-farm spinach production. Framed within the breeders equation, the close match between the trial and production environment and the scope for shorter breeding cycles indoors suggest that genotype and far-red timing can be optimised jointly to accelerate genetic gain.

19

Novel Drosophila cis-regulatory elements can be uncovered by footprinting transcription factor binding sites in ATAC-seq data

Mei, C.; Ness, J.; Nakai, K.; Wunderlich, Z.

2026-06-25 genomics 10.64898/2026.06.22.733832 medRxiv

Top 0.5%

0.4%

Show abstract

Developmental processes depend on carefully coordinated gene expression. Expression is modulated by the binding of transcription factors (TFs) to cis-regulatory elements (CREs), like enhancers and promoters. Many computational and experimental approaches have been developed to find CREs, particularly enhancers, in the genome, each with strengths and caveats. Given the increasing availability of ATAC-seq data and methods to find TF binding therein, we hypothesized that we could use TF footprinting tools to find clusters of TF binding events within accessible chromatin that may act as CREs. Using Drosophila anterior-posterior patterning network as a test bed, we used a digital genomic footprinting tool (DGT), TOBIAS, on previously published early embryo ATAC-seq data to characterize the TF footprint landscape of 16 TFs essential for embryonic patterning. Even in this system, with its extensive enhancer annotation, most footprinted TF binding sites lie outside of known enhancers, with intergenic and intronic regions hosting the highest TF footprint count, albeit at low density. To find potential novel enhancers, we identified high-density TF footprint clusters that are highly conserved and overlap with active enhancer histone mark signals. Five high confidence candidates were selected for reporter assay validation and all five were found to drive spatially patterned expression in the embryo. This study shows that even in a highly characterized system, the analysis of footprinted TF binding sites in ATAC-seq data can uncover new regulatory regions and suggests this approach may be helpful in using existing ATAC-seq data to find novel CREs. ARTICLE SUMMARYGiven the increasing availability of ATAC-seq datasets, workflows to exploit the data to uncover new cis-regulatory elements (CREs), including enhancers, are valuable. Using early anterior-posterior patterning in the Drosophila embryo as a test case, we find that previously published transcription factor footprinting tools and ATAC-seq data can be analyzed to yield new candidate CREs. Experimental validation confirms the activity of selected candidate CREs, suggesting that existing data can be analyzed to find novel regulatory elements.

20

Rootquant: Automated Root Trait Quantification Fromminirhizotron Images Using Deep Learning

Parth, K.; Varela, S.; Liu, Z.; Martini, K. M.; Rajurkar, A.; Allan, D.; McCoy, S.; Ruhter, J.; Walker, S.; Goldenfeld, N.; Leakey, A.

2026-07-08 plant biology 10.64898/2026.07.07.737053 medRxiv

Top 0.5%

0.4%

Show abstract

Quantifying root traits such as root length (RL) and root surface area (RSA) from minirhizotron imagery is a valuable approach for overcoming the phenotyping bottleneck that limits understanding and improvement of crop productivity, resource use efficiency and resilience in field experiments. However, current approaches remain labor-intensive, and deep learning (DL) methods suffer from limited generalization ability. We present RootQuant, an end-to-end DL model that simultaneously predicts RL and RSA directly from minirhizotron images using only whole-image trait values as supervision, thereby eliminating the need for pixel-level annotations. The models generalization ability was evaluated across species and fine-tuning configurations. The practical applicability of the model was further assessed under field conditions by converting image-derived RL estimates into volumetric root length density (vRLD). Using 118,191 maize and soybean images collected between 2009 and 2020, RootQuant trained on both species achieved an R2 of 0.90 and an RMSE of 2.9 mm for RL, and an R2 of 0.88 and an RMSE of 4.2 mm2 for RSA. The same mixed-species model generalized strongly across species, yielding an 8% relative improvement in R2 and a 30% lower RMSE on maize compared with the same architecture trained on a single species and applied zero-shot. Image-derived RL predictions converted to vRLD showed the expected depth-dependent decline in vRLD, as was also found by coincident destructive quantification of roots washed out of soil cores. By providing a generalist backbone model trained on a large dataset from two major crop species, RootQuant enables high-throughput simultaneous estimation of two relevant root traits directly from raw imagery without task-specific fine-tuning, thereby accelerating in situ root system analysis and phenotyping applications.